1 Morning

1.1 Review Example Amplicon Sequencing Results

To view an example result set, open a web browser to…

https://uic-ric.github.io/examples/amplicon/

If obtaining basic processing results from RIC, then you should have the following files.

  • biom-summary.txt - Summary statistics of sequence table
  • report.html - Basic processing report
  • rep_set_sequences.zip - Compressed FASTA file of representative sequences for sequence units, e.g. ASVs or OTUs.
  • rep_set_tax_assignments.txt - Taxonomic assignment of master sequences
  • sequences.zip - ZIP archive of sequences for each sample, after merging, trimming and chimera checking
  • taxa_raw_counts.xlsx - Excel spreadsheet of taxonomic summaries from phylum to species level result - raw sequence counts
  • taxa_raw_counts.zip - ZIP archive of taxonomic summaries from phylum to species level result - raw sequence counts
  • taxa_relative.xlsx - Excel spreadsheet of taxonomic summaries from phylum to species level result - raw sequence counts
  • taxa_relative.zip - ZIP archive of taxonomic summaries from phylum to species level result - relative sequence abundance
  • taxa_table.biom - Sequence/ASV table, in BIOM format

1.1.1 View basic report

  1. Click on the report.html to view the basic processing report.
  2. Observe there are a set of tabs for the various steps in the basic processing of the data
    • Read merging - Merging read pairs into a single sequence
    • Sequence trimming - quality and primer trimming of merged sequences
    • Chimera checking - Chimera checking and removal of trimmed sequences
    • Read count simplification - Includes denoising or OTU clustering details
    • Taxonomic annotation - Generating taxonomic annotations for processed sequences
    • Data normalization - Steps to normalize, e.g. relative abundance or rarefaction, the raw counts data

1.1.2 View raw counts data

Raw sequence counts of the taxonomic summaries are provided in two formats

  • taxa_raw_counts.xlsx - As an Excel spreadsheet.
  • taxa_raw_counts.zip - ZIP archive of summaries in tab delimited (.txt) and BIOM formats.
  1. Go back in your browser to release view (https://uic-ric.github.io/examples/amplicon/)
  2. Click on the taxa_raw_counts.xlsx file to open in Excel.

The taxa_raw_counts.xlsx spreasheet will have separate tabs for each taxonomic level, i.e. phylum to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the raw sequence counts in each sample.

1.1.3 View relative counts data

Prior to generating the relative sequence counts, the data are typically filtered to remove counts from chloroplast or mitochondria. The counts are then expressed as a fraction of the total for each sample, such that the total for each sample should be 1. Relative sequence counts of the taxonomic summaries are provided in two formats.

  • taxa_relative.xlsx - As an Excel spreadsheet.
  • taxa_relative.zip - ZIP archive of summaries in tab delimited (.txt) and BIOM formats.
  1. Go back in your browser to release view (https://uic-ric.github.io/examples/amplicon/)
  2. Click on the taxa_relative.xlsx file to open in Excel.

The taxa_relative.xlsx spreasheet will have separate tabs for each taxonomic level, i.e. phylum to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the relative sequence counts in each sample.

NOTE: The normalization method in these data are rather simplistic. For your analyses you may need a more sophisticated normalization method.

1.2 Using a BIOM file

Prior to these exercies, make sure you have completed the following.

  • If you do NOT have an RStudio project setup for the ‘workshop’ directory, create a new RStudio project.
  • Install the biomformat R package
  • Setup a new script to save the commands you will be running.

1.2.1 Create a new project in R Studio

  1. From File menu select New Project…
  2. From New Project Dialog select New Directory
  3. Select New Project as project type.
  4. In the Create New Project Dialog perform the following
    1. Give a project name, i.e. “Workshop”
    2. Use the Browse… button to set the Create project as subdirectory of to your Desktop
    3. Click Create Project button.

1.2.2 Install biomformat package

If you have not previously done so, install the package biomformat. The biomformat package is available via bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biomformat", update=F)

1.2.3 Create a new script.

  1. From the File menu select New File > R Script
  2. Save file as “my_Rscript.R”

1.2.4 Load a BIOM file.

  1. Open a web browser to the example dataset (https://uic-ric.github.io/examples/amplicon/)
  2. Click the taxa_table.biom file to download it and save it to the ‘workshop’ folder on your Desktop.
  3. In R, load the downloaded taxa_table.biom file.
# load biomformat library.  Allows reading of BIOM files
library(biomformat)

# Load BIOM file for the amplicon sequence table
a_biom <- read_biom("http://wd.cri.uic.edu/metagenomics/taxa_table.biom")

# What class of object is a_biom
class(a_biom)
## [1] "biom"
## attr(,"package")
## [1] "biomformat"
# Take a peek at the BIOM file
a_biom
## biom object. 
## type: OTU table 
## matrix_type: sparse 
## 475 rows and 20 columns

1.2.5 Access the counts table

# Get data in the BIOM file
a_data <- biom_data(a_biom)

# What class of object is a_data
class(a_data)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
# Let's peek at the data
a_data[1:10, 1:5]
## 10 x 5 sparse Matrix of class "dgCMatrix"
##            Sample10_M10_Fec2 Sample11_M12_Fec3 Sample12_M13_Fec1
## asv0000001              7492              5285              8579
## asv0000002              4185             10031              7125
## asv0000003              9262              3841              5786
## asv0000004                 .                 .                11
## asv0000005              2941              3794              2917
## asv0000006              4676              2004              2943
## asv0000007              1503              1399              1767
## asv0000008              4013              1795              2855
## asv0000009                35              1652              2608
## asv0000010              1686               609              1176
##            Sample13_M14_Fec3 Sample14_M15_Fec1
## asv0000001              5488              8666
## asv0000002              9272              6859
## asv0000003              4807              3415
## asv0000004                 .                 .
## asv0000005              3980              2903
## asv0000006              2459              1728
## asv0000007              3271               607
## asv0000008              2263              1616
## asv0000009               275              2810
## asv0000010               291               399

1.2.6 Access the metadata

# Get the observation metadata
obs_md <- observation_metadata(a_biom)

# What type of object is the obs_md
typeof(obs_md)
## [1] "list"
# Take a peek
obs_md[1:7]
## $asv0000001
##        taxonomy1        taxonomy2        taxonomy3        taxonomy4 
##       "Bacteria"   "Bacteroidota"    "Bacteroidia"  "Bacteroidales" 
##        taxonomy5 
## "Muribaculaceae" 
## 
## $asv0000002
##        taxonomy1        taxonomy2        taxonomy3        taxonomy4 
##       "Bacteria"   "Bacteroidota"    "Bacteroidia"  "Bacteroidales" 
##        taxonomy5 
## "Muribaculaceae" 
## 
## $asv0000003
##        taxonomy1        taxonomy2        taxonomy3        taxonomy4 
##       "Bacteria"   "Bacteroidota"    "Bacteroidia"  "Bacteroidales" 
##        taxonomy5 
## "Muribaculaceae" 
## 
## $asv0000004
##        taxonomy1        taxonomy2        taxonomy3        taxonomy4 
##       "Bacteria"   "Bacteroidota"    "Bacteroidia"  "Bacteroidales" 
##        taxonomy5 
## "Tannerellaceae" 
## 
## $asv0000005
##        taxonomy1        taxonomy2        taxonomy3        taxonomy4 
##       "Bacteria"   "Bacteroidota"    "Bacteroidia"  "Bacteroidales" 
##        taxonomy5 
## "Muribaculaceae" 
## 
## $asv0000006
##        taxonomy1        taxonomy2        taxonomy3        taxonomy4 
##       "Bacteria"   "Bacteroidota"    "Bacteroidia"  "Bacteroidales" 
##        taxonomy5 
## "Muribaculaceae" 
## 
## $asv0000007
##                       taxonomy1                       taxonomy2 
##                      "Bacteria"                    "Firmicutes" 
##                       taxonomy3                       taxonomy4 
##                    "Clostridia"                "Lachnospirales" 
##                       taxonomy5                       taxonomy6 
##               "Lachnospiraceae" "Lachnospiraceae NK4A136 group"

1.2.7 TAKE HOME: Generate genus level summary

The observation metadata is provided as a list of named vectors. We will convert this to a tibble (a data.frame like object). With a tibble it is possible to store a list in a column, we can then use the unnest_longer function from the tidyr package to covert the list column into separate entries in the tibble.

  1. Load the dplyr and tidyr packages and create the initial tibble.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

obs_md_tbl <- tibble(obs_id=names(obs_md),
                     obs_md=obs_md) %>%
              unnest_longer(obs_md, values_to = "taxon", indices_to = "level")

# Take a peek at the output
obs_md_tbl
  1. Filter the table to obtain just the rows that have taxonomy entries and omit any rows with NA for the name
taxa_tbl <- subset(obs_md_tbl, grepl("^taxonomy[0-9]", level) & (! is.na(taxon)))

taxa_tbl
  1. Use the pivot_wider function in the tidyr package to convert to wide format. In any instance where a name is missing for that level we will use the values_fill paramter to have pivot_wider fill with the value “Other”
taxa_tbl_w <- pivot_wider(taxa_tbl,
                          names_from="level",
                          values_from = "taxon",
                          values_fill = "Other")

taxa_tbl_w
  1. Join taxonomy1 to taxonomy6 in to the L6 name for that row (observation). The unite function from the tidyr package will allow us to quick do that on a range of columns in our tibble
obs_names <- taxa_tbl_w %>% unite("L6", taxonomy1:taxonomy6, sep=";")

obs_names
  1. Merge obs_names and the biom data. Granted, we could use the a_data object created previously.
L6_data <- merge(obs_names[, c("obs_id", "L6")],
                 as.data.frame(as.matrix(biom_data(a_biom))),
                 by.x=1, by.y=0)

# Take a peek
head(L6_data)[,1:10]
  1. Generate the genus level summary. We will drop the first column (obs_id) and then group by the L6 column and generate the sum for each sample, i.e. all other columns.
L6_sum <- L6_data[,-1] %>% group_by(L6) %>% summarise(across(everything(), sum))

L6_sum

1.3 Review Example Shotgun Metagenomic Sequencing Results

To view an example result set, open a web browser to…

https://uic-ric.github.io/examples/shotgun_metagenomics/

If obtaining basic processing results from RIC, then you should have the following files.

  • report.html - Basic processing report
  • taxa_raw_counts.xlsx - Excel spreadsheet of taxonomic summaries from phylum to species level - raw sequence counts
  • taxa_raw_counts.zip - ZIP archive of taxonomic summaries from phylum to species level - raw sequence counts in TSV format
  • taxa_normalized.xlsx - Excel spreadsheet of taxonomic summaries from phylum to species level - normalized counts (CPM)
  • taxa_normalized.zip - ZIP archive of taxonomic summaries from phylum to species level - normalized counts (CPM) in TSV format
  • taxa_normalized_.xlsx - Excel spreadsheet of taxonomic summaries from phylum to species level for a given kingdom/domain - normalized counts (CPM)
  • taxa_normalized_.zip - ZIP archive of taxonomic summaries from phylum to species level for a given kingdom/domain- normalized counts (CPM) in TSV format
  • funl_table.txt - Functional gene counts table, in tab delimted format
  • funl_table.biom - Functional gene counts table, in BIOM format
  • funl_raw_counts.xlsx - Excel spreadsheet of functional gene summaries - raw sequence counts
  • funl_raw_counts.zip - ZIP archive of functional gene summaries - raw sequence counts in TSV format
  • funl_filtered_table.txt - Filtered functional gene counts table, in tab delimted format
  • funl_filtered_table.biom - Filtred functional gene counts table, in BIOM format
  • funl_filtered_raw_counts.xlsx - Excel spreadsheet of filtered functional gene summaries - raw sequence counts
  • funl_filtered_raw_counts.zip - ZIP archive of filtered functional gene summaries - raw sequence counts in TSV format

1.3.1 View basic report

  1. Click on the report.html to view the basic processing report.
  2. Observe there are a set of tabs for the various steps in the basic processing of the data
    • Taxonomic annotation - Generating taxonomic annotations for reads
    • Functional gene annotation - Generating functional gene annotations for reads

1.3.2 View raw counts data

Raw sequence counts of the taxonomic summaries are provided in two formats

  • taxa_raw_counts.xlsx - As an Excel spreadsheet.
  • taxa_raw_counts.zip - ZIP archive of taxonomic summaries in tab delimited (.txt) format.
  1. Go back in your browser to release view (https://uic-ric.github.io/examples/shotgun_metagenomics/)
  2. Click on the taxa_raw_counts.xlsx file to open in Excel.

The taxa_raw_counts.xlsx spreasheet will have separate tabs for each taxonomic level, i.e. kingdom to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the raw sequence counts in each sample.

Raw sequence counts of the functional gene orthologs (KOs) and functional gene categories are provided both as all detected functional gene orthologs (funl_raw_counts) and filtered to remove reads as defined by the taxonomic filter in the report (funl_filtered_raw_counts).

1.3.3 View relative counts data

For the normalized data the following steps are performed

  1. The data are typically filtered to remove counts from chloroplast or mitochondria.
  2. The raw counts are also split into four (4) different kingdoms/domains, i.e. Bacteria, Archaea, Fungi and Virus
  3. The raw counts for each of the four (4) kingdoms/domains as well as the full data are normalized as a counts per million (CPM) of the total for each sample, such that the total for each sample should be 1,000,000.

Relative sequence counts of the taxonomic summaries are provided in two formats.

  • taxa_normalized.xlsx - As an Excel spreadsheet.
  • taxa_normmalized.zip - ZIP archive of taxonomic summaries in tab delimited (.txt) format.
  1. Go back in your browser to release view (https://uic-ric.github.io/examples/shotgun_metagenomics/)
  2. Click on the taxa_normalized_bacteria.xlsx file to open in Excel.

The taxa_normalized_bacteria.xlsx spreadsheet will have separate tabs for each taxonomic level, i.e. kingdom to species. In each sheet, the first column will be the taxonomic name and subsequent columns will have the normalized sequence counts (CPM) in each sample.

NOTE: The normalization method in these data are rather simplistic. For your analyses you may need a more sophisticated normalization method.

1.5 Review Example Shotgun Metagenomic Assembly Results

To view an example result set, open a web browser to…

https://uic-ric.github.io/examples/shotgun_metagenomics_assembly/

If obtaining basic processing results from RIC, then you should have the following files.

  • report.html - Basic processing report
  • contigs_counts.txt - Read counts for each contig
  • contigs_data.txt - Details of assembled contigs, i.e. length, GC content, taxonomic annotation
  • bins_data.txt - Taxonomic annotations of metagenomic bins.
  • metagenomic_assembly.zip - Contigs from the assembly (zipped FASTA file)
  • orf_counts.txt - Read counts for each detected open reading frame (ORF), a.k.a. gene.
  • annotation_results.zip - ZIP archive of the annotation results.
    • contig_annotations.gff - GFF file of putative gene annotations for all contigs
    • predicted_proteins.faa - FASTA file of predicted amino acid sequences for all CDS features.

1.5.1 View basic report

  1. Click on the report.html to view the basic processing report.
  2. Observe there are a set of tabs for the various steps in the basic processing of the data
    • Sequence trimming - Sequences are trimmed to remove adapters and low quality bases.
    • Denovo genome assembly using short read sequence data - Co-assembly of shotgun data
    • Contig annotation - Generating functional gene annotations for contigs
    • Binning assembled contigs - Contigs are binned based on abundance in the samples.

1.5.2 View read counts for each sample

Raw sequence counts of metagenomic assembly is provided as contig_counts.txt, for each contig the number of mapped reads are displayed.

  1. Go back in your browser to release view.
    • URL is (https://uic-ric.github.io/examples/shotgun_metagenomics_assembly/)
  2. Click on the contig_counts.txt file to open. The file is a tab-delimited file and can be imported into Excel.

1.5.3 View contig annotations

  1. Go back in your browser to release view.
    • URL is (https://uic-ric.github.io/examples/shotgun_metagenomics_assembly/)
  2. Click on the contig_data.txt file to open.
    • The file is a tab-delimited file and can be imported into Excel.
    • This file contains the contig ID/name, length, GC content and taxonomic annotation.

Annotation of contig freatures (ORFs) is available in the contig_annotations.gff file contained in the annotation_results.zip ZIP file.

2 Afternoon

For these exercises, we will be downloading files from the Internet into the workshop folder on your Desktop, then loading the files into R. Prior to these exercies, make sure you have completed the following. If you have already created an RStudio project in the morning session, you do NOT need to complete these steps.

  • If you do NOT have an RStudio project setup for the ‘workshop’ directory, create a new RStudio project.
  • Install the biomformat R package
  • Setup a new script to save the commands you will be running.

2.0.1 Create a new project in R Studio

  1. From File menu select New Project…
  2. From New Project Dialog select New Directory
  3. Select New Project as project type.
  4. In the Create New Project Dialog perform the following
    1. Give a project name, i.e. “Workshop”
    2. Use the Browse… button to set the Create project as subdirectory of to your Desktop
    3. Click Create Project button.

2.0.2 Create a new script.

  1. From the File menu select New File > R Script
  2. Save file as “my_Rscript.R”

2.0.3 Install biomformat package

If you have not previously done so, install the package biomformat. The biomformat package is available via bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biomformat", update=F)

2.1 Load BIOM file and generate basic statistics

# load biomformat library.  Allows reading of BIOM files
library(biomformat)

# Load BIOM file for amplicon sequencing genus level summary
L6_biom <- read_biom("http://wd.cri.uic.edu/metagenomics/taxa_table_L6.biom")

# Get data in BIOM file
L6_data <- biom_data(L6_biom)

# This is a taxonomic summary, so check the observation/feature names
row.names(L6_data)[1:10]
##  [1] "Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Muribaculaceae;Other"                          
##  [2] "Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Tannerellaceae;Other"                          
##  [3] "Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnospiraceae NK4A136 group"   
##  [4] "Bacteria;Firmicutes;Bacilli;RF39;Other;Other"                                                  
##  [5] "Bacteria;Firmicutes;Clostridia;Oscillospirales;[Eubacterium] coprostanoligenes group;Other"    
##  [6] "Bacteria;Firmicutes;Clostridia;Oscillospirales;Oscillospiraceae;Colidextribacter"              
##  [7] "Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Other"                           
##  [8] "Bacteria;Firmicutes;Clostridia;Oscillospirales;Oscillospiraceae;Intestinimonas"                
##  [9] "Bacteria;Firmicutes;Clostridia;Oscillospirales;Oscillospiraceae;Other"                         
## [10] "Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;[Eubacterium] xylanophilum group"
# Determine counts for all samples.
sample_counts <- apply(L6_data, 2, sum)

# View sample counts data
summary(sample_counts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      31   45740   50428   49722   56674   79610
sample_counts[order(sample_counts)]
## Sample21_M25_Fec2 Sample22_M26_Fec1 Sample26_M33_Fec2 Sample23_M29_Fec3 
##                31             32343             39237             39305 
## Sample20_M24_Fec3 Sample27_M34_Fec3 Sample15_M16_Fec2 Sample29_M36_Fec3 
##             45393             45856             46029             48789 
## Sample19_M23_Fec2 Sample16_M18_Fec3 Sample28_M35_Fec3 Sample18_M20_Fec3 
##             49364             49988             50868             53641 
## Sample14_M15_Fec1 Sample11_M12_Fec3 Sample13_M14_Fec3 Sample12_M13_Fec1 
##             55243             55259             55619             59838 
## Sample24_M30_Fec3 Sample25_M32_Fec2 Sample10_M10_Fec2 Sample17_M19_Fec2 
##             60474             62840             64718             79610
hist(sample_counts)

2.2 Filtering and normalization

Note:If you were not able to complete the previous exercise, execute the following code to load example data.

load(url("http://wd.cri.uic.edu/metagenomics/metagenomics.RData"))

2.2.1 Filtering of chloroplasts and mitochondria

# Filter to remove chloroplast and mitochondrial data, First see what the names are...
grep("Chloroplast", row.names(L6_data), value = T)
## character(0)
grep("Mitochondria", row.names(L6_data), value = T)
## [1] "Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Mitochondria;Other"
L6_data_clean <- L6_data[! grepl("Chloroplast", row.names(L6_data)), ]
L6_data_clean <- L6_data_clean[! grepl("Mitochondria", row.names(L6_data_clean)), ]

# Recompute counts for all samples.
clean_sample_counts <- apply(L6_data_clean, 2, sum)

# View sample counts data
summary(clean_sample_counts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      31   45740   50428   49722   56674   79610
clean_sample_counts[order(clean_sample_counts)]
## Sample21_M25_Fec2 Sample22_M26_Fec1 Sample26_M33_Fec2 Sample23_M29_Fec3 
##                31             32343             39237             39305 
## Sample20_M24_Fec3 Sample27_M34_Fec3 Sample15_M16_Fec2 Sample29_M36_Fec3 
##             45393             45856             46029             48789 
## Sample19_M23_Fec2 Sample16_M18_Fec3 Sample28_M35_Fec3 Sample18_M20_Fec3 
##             49362             49988             50868             53641 
## Sample14_M15_Fec1 Sample11_M12_Fec3 Sample13_M14_Fec3 Sample12_M13_Fec1 
##             55243             55259             55619             59838 
## Sample24_M30_Fec3 Sample25_M32_Fec2 Sample10_M10_Fec2 Sample17_M19_Fec2 
##             60474             62840             64718             79610
hist(clean_sample_counts)

# Show difference
clean_sample_counts - sample_counts
## Sample10_M10_Fec2 Sample11_M12_Fec3 Sample12_M13_Fec1 Sample13_M14_Fec3 
##                 0                 0                 0                 0 
## Sample14_M15_Fec1 Sample15_M16_Fec2 Sample16_M18_Fec3 Sample17_M19_Fec2 
##                 0                 0                 0                 0 
## Sample18_M20_Fec3 Sample19_M23_Fec2 Sample20_M24_Fec3 Sample21_M25_Fec2 
##                 0                -2                 0                 0 
## Sample22_M26_Fec1 Sample23_M29_Fec3 Sample24_M30_Fec3 Sample25_M32_Fec2 
##                 0                 0                 0                 0 
## Sample26_M33_Fec2 Sample27_M34_Fec3 Sample28_M35_Fec3 Sample29_M36_Fec3 
##                 0                 0                 0                 0

2.2.2 Perform rarefaction

We will need to install the package vegan that will provide a function to rarefy/subsample a table. If you were asked for a CRAN mirror site, please select one which is close to your location: e.g. “USA (IN)”

install.packages("vegan")

# load the package
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7

From the previous steps we saw that all of our samples have at least 30,000 counts. So let’s perform a rarefaction at that depth.

# Perform a rarefaction at 30k
L6_rare <- t(rrarefy(t(as.matrix(L6_data_clean)), 30000))
## Warning in rrarefy(t(as.matrix(L6_data_clean)), 30000): some row sums < 'sample'
## and are not rarefied
summary(apply(L6_rare, 2, sum))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      31   30000   30000   28502   30000   30000
ncol(L6_rare)
## [1] 20
# Remove samples that were not rarefied
L6_rare <- L6_rare[, apply(L6_rare, 2, sum) == 30000]
summary(apply(L6_rare, 2, sum))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30000   30000   30000   30000   30000   30000
ncol(L6_rare)
## [1] 19

2.2.3 Perform relative sequence count normalization.

# Filter to remove sample with low sequence counts
L6_data_clean <- L6_data_clean[, sample_counts > 5000]

# Generate relative sequence abundance 
L6_relative <- apply(L6_data_clean, 2, function(x) { x / sum(x) })

summary(apply(L6_relative, 2, sum))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

2.2.4 Filter to remove low abundance or sparse taxa.

  1. First we are going to generate a plot of all the taxa from largest to smallest to evaluate the total counts and density (number of samples in which it is present) to determine to good filtering threshold.
# First generate a data frame with the total counts for each taxa and the density
plotdata <- data.frame(taxon=row.names(L6_data_clean),
                       counts=rowSums(as.matrix(L6_data_clean)),
                       density=apply(L6_data_clean, 1, function (x) { length(which(x > 0)) }))

# Sort from largest to smallest.
plotdata <- plotdata[ order(plotdata$counts, decreasing = T), ]

# Get a colors for the possible densities number of samples
density_pal <- rev(terrain.colors(ncol(L6_data_clean)))

# Set a color for each taxon based on its density.
plotdata$density_col <- factor(plotdata$density,
                               levels=seq(1, ncol(L6_data_clean)),
                               labels=density_pal)

# Plot the taxa counts and color by density
plot(plotdata$counts,
     col=as.character(plotdata$density_col),
     pch=16, log="y")

legend('topright', legend = seq(1, ncol(L6_data_clean)),
       col = levels(plotdata$density_col), pch = 16)

  1. Based on that plot, it would make sense to remove any taxa that are present in less than half of the samples and have less than 500 counts (that would average about 25 counts per sample if the taxon was present in all of the samples). We can replot with adding a color if the taxon would be retained or not.
plotdata$retained <- plotdata$counts >= 500 &
                      plotdata$density >= floor(ncol(L6_data_clean) / 2)


# Plot the taxa counts and color by filter state
plot(plotdata$counts,
     col=ifelse(plotdata$retained, "blue", "red"),
     pch=16, log="y")

legend('topright', legend = c("retained", "filtered"),
       col = c("blue", "red"), pch = 16)

# We can also check the fraction of sequence data that will be retained
sum(plotdata$counts[ plotdata$retained ]) / sum(plotdata$counts)
## [1] 0.9957935
  1. Based on those results, we will use the following filtering thresholds
    1. Taxon must have at least 500 counts (summed across all samples)
    2. Taxon must be present in at least 50% of the samples (rounded down to the nearest integer)
# Get list of taxa that satisfy the filter
filtered_taxa <- plotdata$taxon[ plotdata$retained ]

# Filter the relative sequence table
L6_relative_filtered <- L6_relative[filtered_taxa, ]

# Compare the number of rows.
nrow(L6_relative)
## [1] 76
nrow(L6_relative_filtered)
## [1] 40
# Compare the filtered totals
summary(apply(L6_relative_filtered, 2, sum))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9867  0.9948  0.9970  0.9956  0.9977  0.9989

2.3 Basic visualizations

2.3.1 Generate a stacked barchart

# Get top 10 taxa
L6_top10 <- L6_relative_filtered[order(apply(L6_relative_filtered, 1, sum),
                                       decreasing=T)[1:10], ]

# stacked barplot of the top taxa
barplot(as.matrix(L6_top10), legend.text=TRUE)

# Clean up names (Optional)
# Strip off any "Other" names and split by semicolon
clean_names <- strsplit(gsub(";Other", "", row.names(L6_top10)), ";")
# Set prefixes for the different taxonomic levels
level_labels <- c("kingdom", "phylum", "class", "order", "family", "genus")
# Use sapply to generate a name using the prefix and the last valid name in the taxonomic name
clean_names <- sapply(clean_names, function (x) { paste(level_labels[length(x)], x[length(x)])})
# Set the names to the matrix
row.names(L6_top10) <- clean_names

# add better colors and rotate the sample labels
barplot(as.matrix(L6_top10), legend.text=T, col=rainbow(10), las=2)

# put the key outside the plot area
par(mar=c(15,5,5,15))  # put a bigger margin on the right and bottom
barplot(as.matrix(L6_top10), legend.text=T, col=rainbow(10), las=2,
   args.legend = list(x="right", inset=c(-0.7, 0)))

# note: set dev.off() to reset plot parameters
dev.off()
## null device 
##           1

2.3.2 Generate heatmap

NOTE: If you had not previously installed ComplexHeatmap perform the following to install this package via Bioconductor.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("ComplexHeatmap", update=F)
  1. First generate a heat map using the relative abundance data
# Generate heatmap
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.24.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
Heatmap(L6_top10, name="Rel. seq. abund.")

  1. Now, regenerate the heatmap using z-scored data
L6_top10_zscore <- t(scale(t(L6_top10)))
Heatmap(L6_top10_zscore, name="Z-score")

2.4 Perform differential analysis using edgeR

NOTE: If you had not previously installed edgeR perform the following to install this package via Bioconductor.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("edgeR", update = F)
  1. Load the mapping file. For the exercises, we will be using a sample mapping file that we have posted online. For your own studies your will likely be using a mapping file on your computer.
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
  1. First remove any samples with low counts and then subset the mapping file and data to ensure there are the same samples in each.
# Only retain samples with at least 5000 sequence counts
L6_filtered <- L6_data_clean[ , colSums(as.matrix(L6_data_clean)) >= 5000]

sel_samples <- intersect(row.names(mapping), colnames(L6_filtered))

mapping_subset <- mapping[sel_samples, , drop=F]
L6_subset <- L6_filtered[, sel_samples]
  1. Get original sample sizes before filtering
sample_sizes <- apply(L6_subset, 2, sum)
total_sum <- sum(sample_sizes)
  1. Filter the table. Based on the previous filtering exercise we will use the following filtering thresholds.
    1. Taxon must have at least 500 counts (summed across all samples)
    2. Taxon must be present in at least 50% of the samples (rounded down to the nearest integer)
L6_data_filtered <- L6_subset[rowSums(as.matrix(L6_subset)) >= 500 &
                                apply(L6_subset, 1, function (x) { length(which(x > 0)) })
                                          >= floor(ncol(L6_data_clean) / 2), ]
  1. Load edgeR and perform initial setup. We will use the include the original samples sizes
library(edgeR)
## Loading required package: limma
L6_dgelist <- DGEList(counts=L6_data_filtered,
                      lib.size=sample_sizes,
                      group=mapping_subset$Group)
  1. Compute TMM normalization. NOTE: can use other methods of normalization and compare.
L6_dgelist <- calcNormFactors(L6_dgelist, method="TMM")
  1. Estimate dispersion and test for significance
# Estimate dispersions
L6_dgelist <- estimateDisp(L6_dgelist)
## Using classic mode.
L6_test <- exactTest(L6_dgelist)
  1. Build results table.
# run FDR correction and add to table
QValue <- p.adjust(L6_test$table$PValue, method="BH")
L6_diff <- cbind(L6_test$table,QValue)

# Sort by increasing p value
L6_diff <- L6_diff[order(L6_diff$PValue, decreasing = F), ]

# Peek at the results
head(L6_diff)
  1. Save the results from the differential analysis as well as compute the normalized data
# Write results to a file.
write.table(L6_diff, file="diff_analysis.txt", sep="\t", quote=F, col.names = NA)
# Generate the normalized counts and store as an object in R
L6_cpm <- cpm(L6_dgelist)
# Write results to a file.
write.table(L6_cpm, file="norm.txt", sep="\t", quote=F, col.names = NA)

2.5 Perform group significance tests using Kruskal-Wallis test

  1. Read in mapping file and subset and sort to the samples we are using
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
sel_samples <- intersect(row.names(mapping), colnames(L6_relative_filtered))

mapping_subset <- mapping[sel_samples, , drop=F]
L6_subset <- L6_relative_filtered[, sel_samples]
  1. Set up an empty data frame where we will store the results. This will have as many rows as the counts data.
L6_kw <- data.frame(matrix(nrow=nrow(L6_subset),
                           ncol=1 + length(unique(mapping_subset[,1])) ))
  1. We will use this vector to iterate over the groups for calculating means
groups <- unique(mapping_subset[,1])
  1. Set column names to have groups and then PValue
colnames(L6_kw) <- c(groups, "PValue")
  1. Set the row names to the taxa
row.names(L6_kw) <- row.names(L6_subset)
  1. In a loop, get the mean abundance for each taxa, and test with kruskal-wallis
for ( t in 1:nrow(L6_subset) ) {
    for ( g in 1:length(groups) ) {
        L6_kw[t,g] <- mean(L6_subset[t, mapping_subset[,1] == groups[g]])
    }
    kw_res <- kruskal.test(L6_subset[t, ], mapping_subset[,1])
    L6_kw$PValue[t] <- kw_res$p.value
}
  1. Calculate q values (FDR)
L6_kw$QValue <- p.adjust(L6_kw$PValue, method="BH")
  1. Sort by increasing p value
L6_kw <- L6_kw[ order(L6_kw$PValue, decreasing = F), ]
  1. Peek at the results
head(L6_kw)
  1. Write the results to a file
write.table(L6_kw, file="group_signif.txt", sep="\t", quote=F, col.names=NA)

2.6 Alpha diversity Analyses

NOTE: If you had not previously installed the vegan library then install via CRAN. If you were asked for a CRAN mirror site, please select one which is close to your location: e.g. “USA (IN)”

install.packages("vegan")
  • Calculate Shannon (alpha) indices
  • Generate dot/box plot
  • Perform statistical test
  1. Read the mapping file, if you have not done so already
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
  1. Make sure vegan library is loaded
library(vegan)
  1. Calculate the Shannon index using the previously prepared rarefied table. NOTE, we could have subsetted the data to make sure the rarefied data and mapping files have the same samples. However the merge command in the next step will do that for us.
alpha.div <- data.frame(H=diversity(L6_rare, index="shannon", MARGIN=2))
  1. Merge the computed alpha diversity indices with the sample information (mapping table)
L6_alpha <- merge(mapping, alpha.div, by.x=0, by.y=0)
  1. Generate a boxplot of the alpha diversity indices as compared with the group.
boxplot(H ~ Group, data=L6_alpha, main="Alpha diversity",
        xlab="Group", ylab="Shannon index (H)")

  1. Test for differences in the computed alpha diversity indices using the Kruskal-Wallis test
kruskal.test(H ~ Group, L6_alpha)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  H by Group
## Kruskal-Wallis chi-squared = 8.5714, df = 1, p-value = 0.003415

2.7 Beta Diversity (Dissimilarity) Analyses

  • Calculate Bray-Curtis (beta) indices
  • Compute ADONIS & ANOSIM statistics
  • Generate NMDS/PCoA plot

2.7.1 Loading the data

  1. Load the sample mapping data
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)
  1. IF YOU DID NOT COMPLETE THE edgeR EXERCISE load the normalized data.
L6_cpm <- read.delim("https://wd.cri.uic.edu/metagenomics/norm.txt",
                     row.names=1, check.names=F)

2.7.2 Compute the beta diversity indices.

  1. Make sure vegan library is loaded
library(vegan)
  1. Ensure the mapping file in the same samples in same order as the normalized data. NOTE. We did not need to subset both to ensure the sample samples as we had previously subset the counts data when we ran the edgeR analysis.
mapping <- mapping[colnames(L6_cpm), , drop=F]
  1. Generate dissimilarity matrix using Bray-Curtis index. The square root transformation helps with data that exist more on a logarithmic scale but, is not as “strong” of a transformation.
beta_div <- vegdist(t(sqrt(L6_cpm)), method="bray")

2.7.3 Test the dissimilarity

  1. Perform ADONIS/PERMANOVA test
adonis2(beta_div ~ Group, data=mapping)
  1. Perform ANOSIM test. NOTE: Make sure groups are in same order as dissimilarity matrix
anosim(beta_div, mapping[labels(beta_div), "Group"])
## 
## Call:
## anosim(x = beta_div, grouping = mapping[labels(beta_div), "Group"]) 
## Dissimilarity: bray 
## 
## ANOSIM statistic R: 0.7605 
##       Significance: 0.002 
## 
## Permutation: free
## Number of permutations: 999

2.7.4 Generate NMDS plot

  1. Perform NMDS transformation
L6_nmds <- metaMDS(beta_div)
## Run 0 stress 0.06242569 
## Run 1 stress 0.06242581 
## ... Procrustes: rmse 0.0001126714  max resid 0.0002763492 
## ... Similar to previous best
## Run 2 stress 0.06242576 
## ... Procrustes: rmse 7.94217e-05  max resid 0.0001945191 
## ... Similar to previous best
## Run 3 stress 0.1312469 
## Run 4 stress 0.06242568 
## ... New best solution
## ... Procrustes: rmse 6.62879e-05  max resid 0.0001657766 
## ... Similar to previous best
## Run 5 stress 0.1150082 
## Run 6 stress 0.06242572 
## ... Procrustes: rmse 0.0001342272  max resid 0.0003267098 
## ... Similar to previous best
## Run 7 stress 0.06242576 
## ... Procrustes: rmse 0.0001356465  max resid 0.0003296287 
## ... Similar to previous best
## Run 8 stress 0.06242573 
## ... Procrustes: rmse 0.0001545498  max resid 0.0003798337 
## ... Similar to previous best
## Run 9 stress 0.06242582 
## ... Procrustes: rmse 0.0002411579  max resid 0.0005972389 
## ... Similar to previous best
## Run 10 stress 0.06242571 
## ... Procrustes: rmse 9.495894e-05  max resid 0.0002302727 
## ... Similar to previous best
## Run 11 stress 0.06242568 
## ... Procrustes: rmse 3.583144e-05  max resid 8.748141e-05 
## ... Similar to previous best
## Run 12 stress 0.0624258 
## ... Procrustes: rmse 0.0002258467  max resid 0.0005580313 
## ... Similar to previous best
## Run 13 stress 0.0624257 
## ... Procrustes: rmse 7.941642e-05  max resid 0.0001958895 
## ... Similar to previous best
## Run 14 stress 0.06242576 
## ... Procrustes: rmse 0.0001920973  max resid 0.0004720616 
## ... Similar to previous best
## Run 15 stress 0.06242573 
## ... Procrustes: rmse 0.0001085607  max resid 0.0002642562 
## ... Similar to previous best
## Run 16 stress 0.06242572 
## ... Procrustes: rmse 0.0001084033  max resid 0.0002664174 
## ... Similar to previous best
## Run 17 stress 0.0624257 
## ... Procrustes: rmse 9.684037e-05  max resid 0.0002545257 
## ... Similar to previous best
## Run 18 stress 0.06242586 
## ... Procrustes: rmse 0.0002762212  max resid 0.0006849466 
## ... Similar to previous best
## Run 19 stress 0.06242575 
## ... Procrustes: rmse 0.0001717  max resid 0.0004226907 
## ... Similar to previous best
## Run 20 stress 0.06242568 
## ... Procrustes: rmse 3.011759e-05  max resid 7.92814e-05 
## ... Similar to previous best
## *** Solution reached
  1. Setup colors for the groups.
# Get the groups in order of the items in the NMDS plot
colors <- factor(mapping[row.names(L6_nmds$points), "Group"])
group_names <- levels(colors)

# Then change the group names to colors
levels(colors) <- rainbow(length(levels(colors)))
  1. Generate the NMDS ordination plot
plot(L6_nmds$points, col=as.character(colors))

  1. Change the symbols plotted to solid circles and add a legend
plot(L6_nmds$points, col=as.character(colors), pch=16)
# add a legend
legend('topright', legend = group_names,
       col = levels(colors), pch = 16)

  1. Generate a stressplot of NMDS results
stressplot(L6_nmds)

3 Take-home examples

The following are a set of additional examples that you can try on your own.

3.1 Pathway Enrichment in R

This is an example of performing a pathway enrichment analysis of differential results for a functional gene (KEGG ortholog) table. This example assumes that you have already performed the differential analysis, e.g. edgeR or ANCOM, and have the results of the differential analysis saved to a file or currently loaded in R.

This analysis could be performed on results from either a shotgun metagenomics data or a PICRUSt inference from 16S data.

3.1.1 Data input (overview)

For this example of pathway enrichment of functional gene data you will need the following…

  • List of all orthologs in functional gene (KEGG ortholog) table.
  • List of differentialy abundant orthologs, e.g. results from edgeR test of KO table
  • All the possible KEGG pathways of the orthologs. We will use the package KEGGREST to fetch this information from KEGG.

3.1.2 Enrichment test

The following steps are run in R.

  1. Install the KEGGREST package, if you have not done so already.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("KEGGREST", update=F)
  1. Load the data for the analysis.

NOTE: This example assumes that the results from the differential analysis (input file) has the first column as the KEGG orthology ID (KO) and column named QValue with the FDR corrected p value from the differential analysis. If your results differ then you will need to adjust how you generate the list of all_kos and diff_kos.

funl_results <- read.delim("https://wd.cri.uic.edu/metagenomics/diff_funl.txt", row.names=1)

# The tested KOs are in the row names
all_kos <- row.names(funl_results)

# The differential KOs are those in which "Reject null hypothesis" is True.
diff_kos <- row.names(funl_results)[funl_results$QValue <= 0.05]

length(diff_kos)
## [1] 7439
length(all_kos)
## [1] 10624
  1. Load the pathway information for all of the KOs
library(KEGGREST)

# Store the number of KOs in a variable for ease of use
ko_len <- length(all_kos)

# This data frame will have the pathway mapping information
pathway_map <- data.frame(KO=character(), pathway=character())

# The KEGG API only lets one fetch 10 records at a time.
breaks <- seq(1,ko_len, by=10)

for (b in breaks ) {
  end <- min(b + 9, ko_len)
  these_kos <- all_kos[b:end]
  ko_data <- keggLink("pathway", these_kos)
  # Only retain the ko pathways.  KEGG also returns map* records that are the same.
  ko_data <- ko_data[grep("^path:ko", ko_data)]
  # Transform to data frame for easier selection
  ko_data <- data.frame(KO=sub("^ko:", "", names(ko_data)), pathway=ko_data)
  # Add to the existing pathway mapping
  pathway_map <- rbind(pathway_map, ko_data)
  # Sleep for a tenth of a second to not overwhelm KEGG servers
  Sys.sleep(0.1)
}

dim(pathway_map)
head(pathway_map)
## [1] 21318     2
  1. Filter the pathways to only focus on those with a minimum number of KOs
signif_pathways <- subset(pathway_map, KO %in% diff_kos)

# Get the counts of significant KOs for each of the pathways.
# You will likely want to exclude pathways with only one or a few significant KOs
signif_pathways_counts <- table(signif_pathways$pathway)

head(signif_pathways_counts)
## 
## path:ko00010 path:ko00020 path:ko00030 path:ko00040 path:ko00051 path:ko00052 
##           30           17           14           12           26           20
# Filter to only analyze pathways with more than three significant KOs. 
# Based on your results you may want to select a different threshold
signif_pathways_counts <- signif_pathways_counts[ signif_pathways_counts > 3 ]

head(signif_pathways_counts)
## 
## path:ko00010 path:ko00020 path:ko00030 path:ko00040 path:ko00051 path:ko00052 
##           30           17           14           12           26           20
  1. Run Fisher’s Exact Test

For this example, we will only use the first pathway from the filtered results. For your analyses, you would likely want to setup a loop to iterate over all of the pathways in the filtered list.

# Get one of the pathways.  
# For this example, we are skipping over all of the pathways with more than 50 entries.
# These tend to be very large groups and maybe not that meaningful of a pathway
filtered_pathways_counts <- sort(
    signif_pathways_counts[ signif_pathways_counts < 50 ], decreasing = T)
this_pathway <- names(filtered_pathways_counts)[1]

# Print the name of the pathway
this_pathway
## [1] "path:ko00230"
# Get a list of all the KOs in the pathway
kos_in_pathway <- pathway_map$KO[ pathway_map$pathway == this_pathway ]

# next compute the different parts for fisher's exact test
# size of intersection.  Number of differential KOs in this pathway
intersection <- length(intersect(diff_kos,kos_in_pathway))
# number of non-differential pathway KOs
other_pathway <- length(kos_in_pathway) - intersection
# number of differential KOs that are not in the pathway
other_diff <- length(diff_kos) - intersection
# all other genes in genome
other_kos <- length(all_kos) - intersection - other_pathway - other_diff
# format into contingency matrix
assoc <- matrix(c(intersection, other_pathway, other_diff, other_kos), nrow=2)
# Add column and row names to make a nice table display.  
# NOTE: This is not needed for the analysis
colnames(assoc) <- c("In pathway", "Other")
row.names(assoc) <- c("Differential", "No difference")
assoc
##               In pathway Other
## Differential          36  7403
## No difference        117  3068
# run fisher's exact test
fet <- fisher.test(assoc)
fet
## 
##  Fisher's Exact Test for Count Data
## 
## data:  assoc
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.08500913 0.18720334
## sample estimates:
## odds ratio 
##  0.1275471

3.2 Run ANCOM test

3.2.1 Install ANCOMBC package

NOTE: You only need to do this once.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ANCOMBC", update=F)

3.2.2 Prepare the data for analysis

ANCOM expects to have the input data as a SummedExperiment or TreeSummedExperiment. The base sequence table (taxa_table.biom) can be loaded via the loadFromBiom function in the mia package (installed with ANCOMBC package)

  1. Load the BIOM file
biom_tse <- loadFromBiom("http://wd.cri.uic.edu/metagenomics/taxa_table.biom")
## Warning in loadFromBiom("http://wd.cri.uic.edu/metagenomics/taxa_table.biom"):
## 'loadFromBiom' is deprecated. Use 'importBIOM' instead.
# Take a quick peek
biom_tse
## class: TreeSummarizedExperiment 
## dim: 475 20 
## metadata(0):
## assays(1): counts
## rownames(475): asv0000001 asv0000002 ... asv0000474 asv0000475
## rowData names(8): taxonomy1 taxonomy2 ... taxonomy7 taxonomy_unparsed
## colnames(20): Sample10_M10_Fec2 Sample11_M12_Fec3 ... Sample28_M35_Fec3
##   Sample29_M36_Fec3
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL
  1. In the BIOM file, the taxonomy levels are named taxonomy1 to taxonomy7. We will change these to the standard names for easier use with the ANCOMBC package
names(rowData(biom_tse)) <- c("Kingdom", "Phylum", "Class", "Order",
                        "Family", "Genus", "Species")
  1. Load the sample mapping file and add to the biom_tse object.
mapping <- read.delim("https://wd.cri.uic.edu/metagenomics/mapping.txt", row.names=1)

# Ensure the samples in the mapping data.frame is in the same order as the biom_tse object.
mapping <- mapping[colnames(biom_tse), , drop=F]

# Add to the biom_tse.  Will need to convert to DataFrame (special class for
# SummedExperiment objects)
colData(biom_tse) <- DataFrame(mapping)

# Take a final peek at the object.
# 1. We should see the the rowData names are the taxonomic levels
# 2. The colData names should be "Group" (the experimental Group from the mapping file)
biom_tse
## class: TreeSummarizedExperiment 
## dim: 475 20 
## metadata(0):
## assays(1): counts
## rownames(475): asv0000001 asv0000002 ... asv0000474 asv0000475
## rowData names(8): Kingdom Phylum ... Species NA
## colnames(20): NA Sample11_M12_Fec3 ... NA.4 NA.5
## colData names(1): Group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: NULL
## rowTree: NULL
## colLinks: NULL
## colTree: NULL

3.2.3 Run ANCOM

  1. Run the ancombc2 function from the ANCOMBC library. We will run the analysis at the Genus level while computing adjusted p values using a Benjamini-Hochberg FDR correction. The fix_formula parameter sets the formula to fit for the data. As we only have a single factor (“Group”) then the formula will just be this term. The prv_cut parameter sets percent of samples in which the taxa must be present to be evaluated. In this instance, we will set to 0.5 (50%) to match the analyses in the Afternoon exercises.
ancom_res <- ancombc2(biom_tse,
                      tax_level="Genus",
                      p_adj_method = "BH",
                      fix_formula = "Group",
                      prv_cut=0.5)
## Checking the input data type ...
## The input data is of type: TreeSummarizedExperiment
## PASS
## Checking the sample metadata ...
## The specified variables in the formula: Group
## The available variables in the sample metadata: Group
## PASS
## Checking other arguments ...
## PASS
## Obtaining initial estimates ...
## Estimating sample-specific biases ...
## Loading required package: foreach
## Loading required package: rngtools
## ANCOM-BC2 primary results ...
## Conducting sensitivity analysis for pseudo-count addition to 0s ...
## For taxa that are significant but do not pass the sensitivity analysis,
## they are marked in the 'passed_ss' column and will be treated as non-significant in the 'diff_robust' column.
## For detailed instructions on performing sensitivity analysis, please refer to the package vignette.
  1. Get the results table from the ancom_res object.
ancom_diff <- ancom_res$res
  1. Get the columns that end with GroupTreatment. As we had tested the factor “Group” and there were two levels (“Control” and “Treatment”). The first level (Control) is equivalent to the intercept and the computed stats will be relative to that level.
# Get the columns
ancom_diff <- ancom_diff[, c(1, grep("GroupTreatment$", colnames(ancom_diff)))]

# Strip the suffix from the column names
colnames(ancom_diff) <- sub("_GroupTreatment$", "", colnames(ancom_diff))

# Take a peek
head(ancom_diff)
  1. Write the clean diff table to a file.
write.table(ancom_diff, file="ancom_diff.txt", sep="\t", row.names=F, quote=F)
  1. Generate a volcano plot. We will plot the q value (FDR corrected p-value) as a function of the log2 fold change (lfc). The y axis will be plotted on a reverse log scale and points will be colored red if that taxon was considered significantly different.
plot(ancom_diff$lfc, ancom_diff$q,
     log="y", type="p", ylim=c(1, min(ancom_diff$q)),
     xlab="Log2 Fold Change (Treatment - Control)",
     ylab="FDR corrected p Value",
     col=ifelse(ancom_diff$diff, "red", "black"))

  1. Create barchart (with error bars) for significant taxa.
# Get the significant features
plotdata <- subset(ancom_diff, diff)

# Sort by lfc
plotdata <- plotdata[order(plotdata$lfc), ]

# Adjust the margins to increase the left margin
par(mar=c(5, 12, 4, 2))

# Generate the base bar plot and color by down vs. up
base_barplot <- barplot(plotdata$lfc,
                        names.arg = plotdata$taxon,
                        col=ifelse(plotdata$lfc < 0, "blue", "red"),
                        horiz = TRUE,
                        xlim=range( plotdata$lfc + ( sign(plotdata$lfc) * plotdata$se)),
                        las=1,
                        xlab="Log2 fold change")

# Use the arrows function to add the whiskers for the standard error
arrows(y0=base_barplot,
       x0=plotdata$lfc + plotdata$se,
       x1=plotdata$lfc - plotdata$se,
       angle=90,
       code=3,
       length=0.1)

3.3 Using QIIME2 for basic processing

The following is an example workflow for processing raw FASTQ data using QIIME2.

3.3.1 Importing demultiplexed FASTQ data into QIIME2

Import manifest

An import manifest was included with the example data. However, if you needed to create your own import manifest for your own data the import manifest should have the following columns.

  • sample-id - Sample ID
  • forward-absolute-filepath - Absolute path to forward read (read 1) FASTQ file. Can use $PWD to denote current directory.
  • reverse-absolute-filepath - Absolute path to reverse read (read 2) FASTQ file. Can use $PWD to denote current directory.

The following is an example manifest file.

## sample-id      forward-absolute-filepath                    reverse-absolute-filepath
## Sample_001_P1  $PWD/Sample-001-P1_S61_L001_R1_001.fastq.gz  $PWD/Sample-001-P1_S61_L001_R2_001.fastq.gz
## Sample_002_P2  $PWD/Sample-002-P2_S62_L001_R1_001.fastq.gz  $PWD/Sample-002-P2_S62_L001_R2_001.fastq.gz

Processing steps

  1. Import the FASTQ files.
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path manifest.txt \
  --output-path paired-end-demux.qza \
  --input-format PairedEndFastqManifestPhred33V2
## Imported manifest.txt as PairedEndFastqManifestPhred33V2 to paired-end-demux.qza
  1. Generate a visualization of the counts per sample
qiime demux summarize --i-data paired-end-demux.qza \
  --o-visualization paired-end-demux-summary.qzv
## Saved Visualization to: paired-end-demux-summary.qzv
  1. Either run qiime tools view paired-end-demux-summary.qzv or go to https://view.qiime2.org to view the visualization.

3.3.2 Trimming

  1. Run cutadapt to trim on quality, adapters and minimum length. Also, discard if missing adapters. NOTE: you will need to adjust the sequence of the forward (--p-front-f) and reverse (--p-front-r) primers based on the primers used and adjust --p-minimum-length based on your sequencing parameters, e.g. was it 2x150 or 2x300. The following are the parameters for data using the 515F and 806R primers (V4 region) with 2x150 bp sequencing
qiime cutadapt trim-paired \
    --p-front-f ^GTGCCAGCMGCCGCGGTAA \
    --p-front-r ^GGACTACHVGGGTWTCTAAT \
    --p-error-rate 0.10 \
    --p-minimum-length 100 \
    --p-discard-untrimmed \
    --i-demultiplexed-sequences paired-end-demux.qza \
    --o-trimmed-sequences trimmed-reads.qza
## Saved SampleData[PairedEndSequencesWithQuality] to: trimmed-reads.qza
  1. Generate a visualization of the counts per sample
qiime demux summarize --i-data trimmed-reads.qza --o-visualization trimmed-reads-summary.qzv
## Saved Visualization to: trimmed-reads-summary.qzv
  1. Either run qiime tools view trimmed-reads-summary.qzv or go to https://view.qiime2.org to view the visualization.

3.3.3 DADA2 denoising

  1. Run dada2 to merge and denoise the sequence data. NOTE: you may need to adjust these parameters based on your particular data and sequencing setup.
qiime dada2 denoise-paired \
    --i-demultiplexed-seqs trimmed-reads.qza \
    --p-chimera-method consensus \
    --p-hashed-feature-ids \
    --p-trunc-len-f 0 --p-trunc-len-r 0 \
    --p-trunc-q 15 \
    --o-table dada2_table.qza \
    --o-representative-sequences dada2_rep_set.qza \
    --o-denoising-stats dada2_stats.qza
## Saved FeatureTable[Frequency] to: dada2_table.qza
## Saved FeatureData[Sequence] to: dada2_rep_set.qza
## Saved SampleData[DADA2Stats] to: dada2_stats.qza
  1. Export the stats artifact to a text file.
qiime tools export --input-path dada2_stats.qza --output-path dada2_stats
## Exported dada2_stats.qza as DADA2StatsDirFmt to directory dada2_stats
  1. View the stats file
less dada2_stats/stats.tsv
## sample-id input filtered percentage of input passed filter denoised merged percentage of input #q2:types numeric numeric numeric numeric numeric numeric numeric numeric
## Sample_001_P1 4846 4104 84.69 4053 3725 76.87 3684 76.02
## Sample_002_P2 4857 4068 83.76 3991 3636 74.86 3594 74
## Sample_003_P3 4878 4098 84.01 4052 3663 75.09 3599 73.78
## Sample_007_C1 4865 4096 84.19 4017 3597 73.94 3515 72.25
## Sample_008_C1 4876 4075 83.57 4015 3602 73.87 3533 72.46
## Sample_009_C3 4862 4030 82.89 3926 3493 71.84 3398 69.89
  1. Generate a visualization of DADA2 table
qiime metadata tabulate --m-input-file dada2_stats.qza \
  --o-visualization dada2_stats_table.qzv
## Saved Visualization to: dada2_stats_table.qzv
  1. Either run qiime tools view dada2_stats_table.qzv or go to https://view.qiime2.org to view the visualization.

3.3.4 Perform taxonomic annotation

  1. Download the appropriate QIIME2 Silva reference database file from https://docs.qiime2.org/2021.8/data-resources/.

  2. Run the feature classifier

qiime feature-classifier classify-sklearn \
    --i-reads dada2_rep_set.qza \
    --i-classifier silva-138-99-515-806-nb-classifier.qza \
    --o-classification rep_set_assignments.qza
## Saved FeatureData[Taxonomy] to: rep_set_assignments.qza
  1. Tabulate the assignment statistics
qiime metadata tabulate \
    --m-input-file rep_set_assignments.qza \
    --o-visualization assignments_stats.qzv
## Saved Visualization to: assignments_stats.qzv
  1. Either run qiime tools view assignments_stats.qzv or go to https://view.qiime2.org to view the visualization.

3.3.5 Filtering and generating genus level summary

  1. Filter feature table to remove ASVs associated with Chloroplast or Mitochondria
qiime taxa filter-table \
    --i-table dada2_table.qza \
    --i-taxonomy rep_set_assignments.qza \
    --p-exclude c__Chloroplast,f__mitochondria \
    --o-filtered-table dada2_table_clean.qza
## Saved FeatureTable[Frequency] to: dada2_table_clean.qza
  1. Generate summary of clean DADA2 table.
qiime feature-table summarize \
    --i-table dada2_table_clean.qza \
    --o-visualization dada2_table_clean_summary.qzv
## Saved Visualization to: dada2_table_clean_summary.qzv
  1. Either run qiime tools view dada2_table_clean_summary.qzv or go to https://view.qiime2.org to view the visualization.

  2. Generate a genus level summary of the clean DADA2 table

qiime taxa collapse \
    --i-table dada2_table_clean.qza \
    --i-taxonomy rep_set_assignments.qza \
    --p-level 6 \
    --o-collapsed-table genus_summary_clean.qza
## Saved FeatureTable[Frequency] to: genus_summary_clean.qza

3.3.6 Export QIIME2 data to BIOM files and load in R

  1. Export counts tables as BIOM files
qiime tools export \
    --input-path genus_summary_clean.qza \
    --output-path genus_summary_clean
## Exported genus_summary_clean.qza as BIOMV210DirFmt to directory genus_summary_clean
  1. Open R
R
  1. Execute the following in R
# load biomformat library.  Allows reading of BIOM files
library(biomformat)

# Load BIOM file for genus level summary
genus_biom <- read_biom("genus_summary_clean/feature-table.biom")

# NOTE! We have noticed some issues with the biomformat library.  
# If the previous read command fails with a message that includes the following...
#     `unable to translate 'lexical error: invalid char in json text.`
# Then use the following command to load the BIOM file.
genus_biom <- biom(read_hdf5_biom("genus_summary_clean/feature-table.biom"))

# Get data in BIOM file
genus_data <- biom_data(genus_biom)

# This is a taxonomic summary, so check the observation/feature names
row.names(genus_data)[1:10]

# Determine counts for all samples.
sample_counts <- apply(genus_data, 2, sum)

# View sample counts data
summary(sample_counts)
sample_counts[order(sample_counts)]
hist(sample_counts)